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Abstract 

We study the worst case error of kernel density estimates via subset approximation. A kernel density 
estimate of a distribution is the convolution of that distribution with a fixed kernel (e.g. Gaussian kernel). 
Given a subset (i.e. a point set) of the input distribution, we can compare the kernel density estimates of 
the input distribution with that of the subset and bound the worst case error If the maximum error is s, 
then this subset can be thought of as an e-sample (aka an e-approximation) of the range space defined 
with the input distribution as the ground set and the fixed kernel representing the family of ranges. 
Interestingly, in this case the ranges are not binary, but have a continuous range (for simplicity we focus 
on kernels with range of [0, 1]); these allow for smoother notions of range spaces. 

It turns out, the use of this smoother family of range spaces has an added benefit of greatly decreas- 
ing the size required for e-samples. For instance, in the plane the size is 0{{l/s^^^) log^/^(l/e)) for 
disks (based on VC-dimension arguments) but is only 0((l/e)i,/log(l/e)) for Gaussian kernels and 
for kernels with bounded slope that only affect a bounded domain. These bounds are accomplished by 
studying the discrepancy of these "kernel" range spaces, and here the improvement in bounds are even 
more pronounced. In the plane, we show the discrepancy is O(vTogn) for these kernels, whereas for 
balls there is a lower bound of ^{n^^'^). 



1 Introduction 



We study the Loo error in kernel density estimates of points sets by a kernel density estimate of their subset. 
Formally, we start with a size n point set P C M'^ and a kernel : M*^ x M°' — M. Then a kernel density 
estimate KDEp of a point set P is a convolution of that point set with a kernel, defined at any point x G M^: 



KDEp(x) = 



K{x,p) 



The goal is to construct a subset S C P, and bound its size, so that it has e-bounded Loo error, i.e. 

Loo (KDEp, KDE5) = max |KDEp(x) - KDEs(x)| < S. 



We call such a subset S an e-sample of a kernel range space (P, %), where % is the set of all functions 
K{x, •) represented by a fixed kernel K and an arbitrary center point x G W^. Our main result is the 
construction in of an e-sample of size 0((l/e) -y/log(l/e)) for a broad variety of kernel range spaces. 

We will study this result through the perspective of three types of kernels. We use as examples the ball 
kernels 23, the triangle kernels T, and the Gaussian kernels S; we normalize all kernels so K{p,p) = 1. 



• For e S 

• For K e7 

• For K £S 



K{x,p) ={lif||x— < 1 and otherwise}. Gaussian mangle 

K{x,p) = max{0, 1 — ||x — p\\}. 
K{x,p) = exp(— ||x — pIP). 



Our main result holds for T and S, but not S. However, in the context of combinatorial geometry, kernels 
related to S (binary ranges) seem to have been studied the most heavily from an Loo error perspective, and 
require larger e-samples. In Appendix|A]we show a lower-bound that such a result cannot hold for S. 

We re-describe this result next by adapting (binary) range spaces and discrepancy; these same notions 
will be used to prove our result. 



Range spaces. A kernel range space is an extension of the combinatorial concept of a range space. Let 
P C M'^ be a set of n points. Let A <Z 2^ he the set of subsets of P, for instance when A = they are 
defined by containment in a ball. The pair (P, A) is called a range space. 

Thus we can re-imagine a kernel range space (P, %) as the family of fractional subsets of P, that is, each 
p € P does not need to be completely in (1) or not in (0) a range, but can be fractionally in a range described 
by a value in [0, 1]. In the case of the ball kernel K(x, •) G S we say the associated range space is a binary 
range space since all points have a binary value associated with each range, corresponding with in or not in. 



Colorings and discrepancy. Let x '■ P ^ +1} be a coloring of P. The combinatorial discrepancy 
of {P,A), given a coloring x is defined dy^{P,A) = maxpgyi | J2p(^Fi.x{p)\- For ^ kernel range space 
(P, %), this is generalized as the kernel discrepancy, defined dy.{P, %) = max^^gj^d X^peP x{p)K{x, p); we 
can also write dy.{P, K^) = Ylp^p x{p)K{x,p) for a specific kernel Kx, often the subscript x is dropped 
when it is apparent. Then the minimum kernel discrepancy of a kernel range space is defined d(P, %) = 
min^ d^{P, %). See Matousek's ll26l and Chazelle's ifTOl books for a masterful treatments of this field when 
restricted to combinatorial discrepancy. 



Constructing e-samples. Given a (binary) range space {P,A) an e-sample (a.k.a. an e-approximation) 
is a subset S C P such that the density of P is approximated with respect to A so 



max 

ReA 



\RnP\ 



\RnS\ 



\P\ 



\s\ 



< e. 
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Clearly, an e-sample of a kernel range space is a direct generalization of the above defined e-sample for 
(binary) range space. In fact, recently Joshi et al. ||2TI showed that for any kernel range space (P, %) where 
all super-level sets of kernels are described by elements of a binary range space (P, A), then an e-sample of 
(P, A) is also an e-sample of (P, %). For instance, super-level sets oi^,7 are balls in S. 

e-Samples are a very common and powerful coreset for approximating P; the set S can be used as 
proxy for P in many diverse applications (c.f. |l2l[30j[l4l[3Tl). For binary range spaces with constant VC- 
dimension lt36l a random sample S of size 0((l/e^) log(l/(5)) provides an e-sample with probability at 
least 1 — 6 ll24l . Better bounds can be achieved through deterministic approaches as outlined by Chazelle 
and Matousek HJ], or see either of their books for more details [JO. ,26.1 . This approach is based on the 
following rough idea. Construct a low discrepancy coloring x of ^rid remove all points p ^ P such 
that x{p) = —1- Then repeat these color-remove steps until only a small number of points are left (that 
are always colored +1) and not too much error has accrued. As such, the best bounds for the size of e- 
samples are tied directly to discrepancy. As spelled out explicitly by Phillips [30J (see also ||26llT0i for more 
classic references), for a range space {P,A) with discrepancy 0(log^ |P|) (resp. 0{\P\^ log^ \ th^t can 
be constructed in time 0{\P\^ log"^(|P|)), there is an e-sample of size g{e) = 0((l/e) log^(l/e)) (resp. 
0(((l/e)log^(l/e))^/(^-'^)))thatcanbe constructed in time 0(u;"'-^n • (^(e))'"-^ • log'^(5(e)) + g{e)). 
Although, originally intended for binary range spaces, these results hold directly for kernel range spaces. 



1.1 Our Results 

Our main structural result is an algorithm for constructing a low-discrepancy coloring x of a kernel range 
space. The algorithm is relatively simple; we construct a min-cost matching of the points (minimizes sum 
of distances), and for each pair of points in the matching we color one point +1 and the other —1 at random. 

Theorem 1.1. ForP C of size n, the above coloring x> has discrepancy d^{P, 7) = 0{n^^'^^^^'^y^log{n/6)) 
and d^{P, S) = 0{n^^'^^^/'^^y^og(n/6)) with probability at least 1 — 6. 

This implies an efficient algorithm for constructing small e-samples of kernel range spaces. 

Theorem 1.2. For P C M'^, with probability at least 1 — 6, we can construct in 0{n/e^) time an e-sample 
of{P, T) or (P, g) of size 0((l/e)2'^/('^+2) Iog'^/('^+2)(l/e5)). 

Note that in M?, the size is 0((l/e)iyiog(l/e5)), near-linear in 1/e, and the runtime can be reduced to 
0{{n/y/e) log^(l/e)). Furthermore, for "B, the best known upper bounds for discrepancy (which are tight 
up to a log factor, see Appendix |a| are noticeably larger at 0((l/e)2"'/(''+^) • log'^/("'+^)(l/e)), especially 
in R2 at O ( ( 1 /e)4/3 log2/3 ( i /e) ) . 

We note that many combinatorial discrepancy results also use a matching where for each pair one is 
colored +1 and the other —1. However, these matchings are "with low crossing number" and are not easy to 
construct. For a long time these results were existential, relying on the pigeonhole principle. But, recently 
Bansal |6| provided a randomized constructive algorithm; also see a similar, simpler and more explicit, 
approach recently on the arXiv |25|. Yet still these are quite more involved than our min-cost matching. 
We believe that this simpler, and perhaps more natural, min-cost matching algorithm may be of independent 
practical interest. 



Proof overview. The proof of Theorem 1.2 follows from Theorem |1.1[ the above stated results in f30\, 
and Edmond's 0(n'^) time algorithm for min-cost matching M [18|. So the main difficulty is proving 
Theorem 1 1.1 1 We first outline this proof in Mp on T. In particular, we focus on showing that the coloring x 
derived from M, for any single kernel K, has d^{P, K) = 0{\/\og{l/ 6)) with probability at least 1 — 6. 
Then in Section |4.l| we extend this to an entire family of kernels with d^(P, 3C) = 0{^J\og{n/ 6)). 
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The key aspect of kernels required for the proof is a bound on their slope, and this is the critical difference 
between binary range spaces (e.g. S) and kernel range spaces (e.g. T). On the boundary of a binary range 
the slope is infinite, and thus small perturbations of P can lead to large changes in the contents of a range; 
all lower bounds for geometric range spaces seem to be inherently based on creating a lot of action along the 
boundaries of ranges. For a kernel K{x, •) with slope bounded by a, we use a specific variant of a Chernoff 
bound (in Section jijl that will depend only on J2j where Aj = K{x,pj) — K{x, qj) for each edge 
{Pjj Qj) £ Note that Ylj ^ ^x(^' ^) gives a bound on discrepancy, but analyzing this directly 
gives a poly(n) bound. Also, for a binary kernels J2j ^| = poly(^)' but if the kernel slope is bounded then 
J2j < cr^ J2j \\Pj ~ IjW^- Then in Sectionjsjwe bound \\pj — qj |p = 0(1) within a constant radius 
ball, specifically the ball for which K{x, •) > 0. This follows (after some technical details) from a result 
of Bern and Eppstein lU. 

Extending to 9 requires multiple invocations (in Section|4j) of the J2j bound from Sectionjsj Extend- 
ing to M*^ basically requires generalizing the matching result to depend on the sum of distances to the dth 
power (in Section[3| and applying Jensen's inequality to relate to A| (in Section|4|. 

1.2 Motivation 

Near-linear sized e-samples. Only a limited family of range spaces are known to have e-samples with 
size near-linear in Most reasonable range spaces in admit an e-sample of size 1/e by just sorting the 
points and taking every e|P|th point in the sorted order. However, near-linear results in higher dimensions 
are only known for range spaces defined by axis-aligned rectangles (and other variants defined by fixed, 
but non necessarily orthogonal axes) (c.f. 041 1301 ). All results based on VC-dimension admit super-linear 
polynomials in 1/e, with powers approaching 2 as d increases. And random sampling bounds, of course, 
only provide e-samples of size 0(l/e^). 

This polynomial distinction is quite important since for small e (i.e. with e = 0.001, which is important 
for summarizing large datasets) then (in our example the size = 1,000,000) is so large it of- 
ten defeats the purpose of summarization. Furthermore, most techniques other than random sampling (size 
1/e^) are quite complicated and rarely implemented (many require Bansal's recent result ||6l or its simplifi- 
cation ||251 ). One question explored in this paper is: what other families of ranges have e-samples with size 
near-linear in 1/e? 

This question has gotten far less attention recently than e-nets, a related and weaker summary. In that 
context, a series of work ||20l HI |27l |23l S HH has shown that size bound of 0((l/e) log(l/e)) based on 
VC-dimension can be improved to 0((l/e) loglog(l/e)) or better in some cases. Super-linear lower bounds 
are known as well fT,'29'|. We believe the questions regarding e-samples are even more interesting because 
they can achieve polynomial improvements, rather than at best logarithmic improvements. 

Loo kernel density estimates. Much work (mainly in statistics) has studied the approximation properties 
of kernel density estimates; this grew out of the desire to remove the choice of where to put the breakpoints 
between bins in histograms. A large focus of this work has been in determining which kernels recover the 
original functions the best and how accurate the kernel density approximation is. Highlights are the books 
of Silverman [331 on L2 error in M}, Scott |[32l one L2 error in M'^, and books by Devroye, Gyorfi, and/or 
Lugosi ifTSl [TTl [161 on Li error. We focus on Loo error, which has not generally been studied. 

Typically the error is from two components: the error from use of a subset (between KDEp and KDE5), 
and the error from convoluting the data with a kernel (between P and KDEp). Here we ignore the second 
part, since it can be arbitrary large under the Loo error. Typically in the first part, only random samples have 
been considered; we improve over these random sample bounds. 

Recently Chen, Welling, and Smola |[T3l showed that for any positive definite kernel (including S, but not 
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Tor S) a greedy process produces a subset of points S C P such that L2(kdE5, KDEp) < e when S is size 
15"! = 0{l/e). This improves on standard results from random sampling theory that require \S\ = 0(l/e^) 
for such error guarantees. This result helped inspire us to seek a similar result under Lqo error. 

Relationship with binary range spaces. Range spaces have typically required all ranges to be binary. 
That is, each p £ Pis either completely in or completely not in each range R £ A. Kernel range spaces were 
defined in a paper last year 1211 (although similar concepts such as fat-shattering dimension |[22l appear in 
the learning theory literature |l7l[T6l|4l|33> their results were not as strong as [21] requiring larger subsets 
S, and they focus on random sampling). That paper showed that an e-sample for balls is also an e-sample 
for a large number of kernel range spaces. An open question from that paper is whether that relationship 
goes the other direction: is an e-sample of a kernel range space also necessarily an e-sample for a binary 
range space? We answer this question in the negative; in particular, we show near-linear sized e-samples for 
kernel range spaces when it is known the corresponding binary range space must have super-linear size. 

This raises several questions: are these binary range spaces which require size super-linear in 1/e really 
necessary for downstream analysis? Can we simplify many analyses by using kernels in place of binary 
ranges? One possible target are the quite fascinating, but enormous bounds for bird-flocking fTTl . 

2 Preliminaries 

For simplicity, we focus on only rotation-and-shift-invariant kernels so K{pi,pj) = k{\\pi — Pj\\) can be 
written as a function of just the distance between its two arguments. The rotation invariant constraint can 
easily be removed, but would complicate the technical presentation. We also assume the kernels have been 
scaled so K{p,p) = k{0) = 1. Section [5] discusses removing this assumption. 

We generalize the family of kernels T (see above) to Scr which we call a-bounded; they have slope 
bounded by a constant a > (that is for any x,q,p G M"' then \K{x,p) — K{x,q)\/\\q — p\\ < a), 
and bounded domain = {p G M*^ | K{x,p) > 0} defined for any x S M'^. For a rotation-and- 
shift-invariant kernel, Bx is a ball, and for simplicity we assume the space has been scaled so Bx has 
radius 1 (this is not critical, as we will see in Section [5] since as the radius increases, a decreases). In 
addition to T this class includes, for instance, the Epanechnikov kernels E so for K{x,-) G £ is defined 
K{x,p) = max{0, 1 — — 

We can also generalize S (see above) to a family of exponential kernels such that I /^(z) I < exp(— | poly(2;)|) 
and has bounded slope cr; this would also include, for instance, the Laplacian kernel. For simplicity, for the 
remainder of this work we focus technically on the Gaussian kernel. 

Let rad(i?) denote the radius of a ball B. The d-dimensional volume of a ball B of radius r is denoted 
vol^(r) = (7r'^/^/r((i/2 + l))r'^ where T{z) is the gamma function; note r(2;) is increasing with z and when 
z is a positive integer T{z) = {z — 1)!. For balls of radius 1 we set vol(i(l) = Vd, a constant depending only 
on d. In general yQ\d{r) = V^r'^. Note that our results hold in M"' where d is assumed constant, and O(-) 
notation will absorb terms dependent only on d. 

3 Min-Cost Matchings within Balls 

Let P C M'^ be a set of 2n points. We say M{P) (or M when the choice of P is clear) is a perfect matching 
of P if it defines a set of n unordered pairs {q, p) of points from P such that every p G P is in exactly one 
pair. We define cost of a perfect matching in terms of the sum of distances: c(M) = ^^(gp)^^/ Ik ~ 
The min-cost perfect matching is the matching M* = arg miiiA,/ c(M). 

The proof of the main result is based on a lemma that relates the density of matchings in M* to the volume 
of a ball containing them. For any ball B C W^, define the length p{B, M) of the matchings from M within 
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B as: 



p{B,M) 



{q,p)&M 



\\q — pW^ if q,p € B 

\\q ~ PbW^ if (7 G i? and p ^ B, where pB is the intersection of qp with dB 

if q,p ^ B 



Lemma 3.1. There exists a constant (pd that depends only on d, such that for any ball B (ZW^ the length 
p{B,M*) < {cf>d/Vd)vold{rad{B)) = (t>drad{BY. 

Proof. A key observation is that we can associate each edge {q,p) G M* with a shape that has d-dimensional 
volume proportional to \\q — pW^, and if these shapes for each edge do not overlap too much, then the sum of 
edge lengths to the dth power is at most proportional to the volume of the ball that contains all of the points. 

In fact, Bern and Eppstein |8| perform just such an analysis considering the minimum cost matching of 
points within [0, 1]"'. They bound the sum of the dth power of the edge lengths to be 0(1). For an edge 
{q,p), the shape they consider is a. football, which includes all points y such that qyp makes an angle of 
at most 170°; so it is thinner than the disk with diameter from q to p, but still has d-dimensional volume 

niiiq-pW^). 

To apply this result of Bern and Eppstein, we can consider a ball of radius 1/2 that fits inside of [0, 1]*^ 
by scaling down all lengths uniformly by l/2rad(i?). If we show the sum of dth power of edge lengths is 
0(1) for this ball, and by scaling back we can achieve our more general result. From now we will assume 
rad(i?) = 1/2. Now the sum of dth power of edge lengths where both endpoints are within B is at most 
0(1) since these points are also within [0, l]'^. 

Handling edges of the second type where p ^ i? is a bit more difficult. Let B be centered at a point x. 
First, we can also handle all "short" edges {q,p) where ||x — p\\ < 10 since we can again just scale these 
points appropriately losing a constant factor, absorbed in 0(1). 

Next, we argue that no more than 360'^^^ "long" edges {q,p) G M* can exist with q ^ B and ||p — x|| > 
10. This implies that there are two such edges {q,p) and {q' ,p') where the angle between the vectors from 
X to p and from x to p' differ by at most 1 degree. We can now demonstrate that both of these edges cannot 
be in the minimum cost matching as follows. 




Let 10 < 



and let y be the point a distance 10 from x on the ray towards p' . We 



can now see the following simple geometric facts: (Fl) \\q — q'\\ < 1, (F2) 
x\\, (F4) Hp — y\\ + Hp' — y\\ < Hp — p'\\, and (F5) ||p ■ 



Hp' 
\\p' 



+ 10 = \\p' 
- 1/2 < 



p' — q'\\. It follows that 



\Q-q 



+ \\p-p\\ < 1 + Hp ■ 

via(Fl)and(F4) 



y\\ + 



Wp — yH ^ 1 + Hp ~ ^1 

via (F2) and (F3) 



+ Hp " 



|p — yH < Hp ~ ^H' (F3) 

- xH — 1/2 < Hp — q\\ and 



q\\ + Hp 



I - 10 < Hp - 

via (F5) 

Thus it would be lower cost to swap some pair of edges if there are more than 360'^~^ of them. Since each 
of these at most 360^~^ long second-type edges can contribute at most 1 to p{B, M*) this handles the last 
remaining class of edges, and proves the theorem. □ 



4 Small Discrepancy for Kernel Range Spaces 

We construct a coloring x ■ P {— l,+l}ofPby first constructing the minimum cost perfect matching 
M* of P, and then for each (pj, qj) G M* we randomly color one of {pj, qj} as +1 and the other —1. 
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To bound the discrepancy of a single kernel d^^(P, K) we consider a random variable Xj = xiPj)K{x,pj)+ 
x{qj)K{x,qj) for each pair {pj,qj) G M*, so d^{P,K) = \ J2j-^j\- We also define a value Aj = 
2\K{x,pj) — K{x,qj)\ suchthatXj G {—Aj/2,Aj/2}. The key insight is that we can bound using 
the results from Section|3] Then since each Xj is an independent random variable and has = 0, we 

are able to apply a Chernoff bound on d^^{P, K) = \ Xj\ that says 



Pr[d^(P,K)>a]<2exp|^^^j . (4.1) 
In M^, for cj -bounded kernels achieving a probabilistic discrepancy bound is quite straight-forward at this 



point (using Lemma 3.1 1; and can be achieved for Gaussian kernels with a bit more work. 

However for points P C M'^ for d > 2 applying the above bound is not as efficient since we only have a 
bound on Aj. We can attain a weaker bound using the Jensen's inequality over at most n terms 



d/2 / \ 2/d 

n ■' I ^—^ n 



^E-i^^f" so we can state A| < n^^/^ A,^ . (4.2) 



cr-bounded kernels. We start with the result for u-bounded kernels using Lemma 3. 1 



Lemma 4.1. In R'^, for any kernel K ^ §>„ we can construct a coloring x such that Pr[d^(P, > 
^i/2-i/d^(^^)i/d^2 M2/5)] < S for any 5 > 0. 

Proof. Consider some K{x, •) e and recall = {y gM? \ K{x, y) > 0}. Note that 

Y^A'^j =Y,2\K{p„x)-K{q,,x)f < 2Vp(i?,,M*) < a^2UdradiB^f < a'^2%, 

j j 

where the first inequality follows by the slope of K ^ 7 being at most a and K{p, x) = for p ^ B, 
since we can replace p . wi th Pj^B.^ when necessary since both ha ve K {x, •) = 0, and the second inequal- 
ity follows by LemmalO Hence, by Jensen's inequality (i.e. (4.2 1) I]^ A^ < n^^'^/'^{a'^2'^4)df^''' = 

„l-2/d^24(^^)2/d 



we can 



We now study the random variable d^{P,K) = \ for a single K G Invoking (4.1 

bound Pr[dx(P,i^) > a] < 2 exp(-aV(n ^-^/ ^2( 0^)^/°')). Setting a = n^l'^ '^/'^a{4)dY^ ^/2 ln(2/5) 
reveals VT[d^{P, K) > ni/2-i/<i^(0^)i/d^2 \n{2/5)] <5. □ 

For T and £ the bound on it is 1 and 2, respectively. Also note that in R? the expected discrepancy for 
any one kernel is independent of n. 

Gaussian kernels. Now we extend the above result for cr-bounded kernels to Gaussian kernels. It requires 



a nested application of Lemma 3.1 Let zi = 1/2* and Bi = {p £ R. \ K{x,p) > Zi}; let Ai = Bi \ Bi-i 
be the annulus of points with K{x,p) G [zi, Zi^i). For simplicity define Bq as empty and Ai = Bi. We 
can bound the slopes within each annulus ylj as ui = 1 = max ^{—k{y)), and more specifically for i > 2 

then cTj = maXy(zA^ = ^i^i-i) = 1/2*"^ Define yi = \/iln2 so that k{yi) = Zi. 



We would like to replicate Lemma 3.1 but to bound p{B, M*) for annuli Ai instead of B. However, 
this cannot be done directly because an annulus can become skinny, so its d-dimensional volume is not 
proportional to its width to the dth power. This in turn allows for long edges within an annulus that do not 
correspond to a large amount of volume in the annulus. 
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We can deal with this situation by noting that each edge either counts towards a large amount of volume 
towards the annulus directly, or could have already been charged to the ball contained in the annulus. This 
will be sufficient for the analysis that follows for Gaussian kernels. 

Define p{Ai,M) as follows for an annulus centered at point x. For each edge {pj, qj) G M let qj^i be the 
point on pjqj n Ai furthest from x; if there is a tie, choose one arbitrarily. Let pj i be the point on pjqj n Ai 
closest to X, if there is a tie, choose the one closer to j. Then p{Ai, M) = '}2i{p q-)&M ll^'i.i ~ Ij^W^- 

Lemma 4.2. p{Ai, M) < p{B^, M) - p{Bi^i,M)for i > 1. 

Proof. For each edge (pj^qj) € M such that pJqJ intersects Bi, has piql n Bi as either entirely, partially, 
or not at all in Ai. Those that are entirely in or not at all in Ai either contribute exclusively to p{Ai, M) or 
M), respectively. Those that are partially in Ai contribute to Ai and Sj-i in a superadditive way 
towards Bi leading to the claimed inequality. □ 



3.1 



Now notice that since yi = Viln2thenp(5i,M*) < (/>d(ln 2)'^/2i'^/2, by Lemma 

Lemma 4.3. Let^{n,d,5) = 0(ni/2-i/d^in(i/5)). In W^, for any kernel K G 2 we can construct a 
coloring x such that Pv[d^{P, K) > ^'(n, d, 6)] < 5 for any 6 > 0. 



Proof. Now to replicate the result in Lemma 4.1 we need to show that Yl{pj p'.)eM* i^iQj) ~ ^{Pj)Y < ^ 
for some constant C. For a kernel -fC G S centered at a point x G M'^ we can decompose its range of 
influence into a series of annuli Ai, A2, ■ ■ ■ , as described above. 

Define Vi^j = K{pj^i) — K{qj^i) if both pj^i and qj^i exist, otherwise set j = 0. Note that j < 
\\Pj,i ~ since the slope is bounded by l/2*~^ in A^. 

For any {pj,qj) G M* we claim K{qj) — K{pj) < Ylii/^i,i — 3maxjfjj. The first inequality follows 
since the slope within annulus Ai is bounded by l/2*~^, so each Vij corresponds to the change in value of 
{pj, qj) with Ai. To see the second inequality, let £ be the smallest index such that K{x,pj) > ze, that is 
A£ is the smallest annulus that intersects qjp]. We can see that f^+ij > since (t/j — yi-i) is 

a decreasing function of i and 1/2*^^ is geometrically decreasing. Thus the argmaxj Vij is either vej or 
V£+ij, since Vij = for i < £. If v^j > ve+ij then 3vij > vij + vi+ij + = Si^i "^ij- If 

Vij < Vi+ij then 3v£+ij > v^j + v^+ij + X]^£+2 = Zli^i '^hj- Hence 

{K{qj)-K{pj)f < {Y,v^,f < (3maxr;,,y < 3'Y.<r 

i i 

We can now argue that the sum over all Ai, that q )eM* '^fj bounded. By Lemma 

summing over all {pj, qj) in a fixed Ai 

V = V IIPj-.^ -g^-.^ll'' = pj^i^M*) piBi,M*)-p{Bi.uM*) 

{pj,qj)GM* {Pj,qj)(^M* 

Hence (using a reversal in the order of summation) 



4.2 



and 



00 00 

{pj,qj)eM" (pj,gj)eA/* i=l 4=1 (pj,g^)eAf 



i=l i=l 



< 
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since for i > 1 and d > 2 the term {i In 2)^/^/2^^* < 1/2* is geometrically decreasing as a function of ; 



Finally we plug = 2\K{qj) — K{pj)\ and using (4.2i 



into a Chernoff bound ( 4. 1 1 on n pairs in a matching, as in the proof of Lemma 4. 1 where Xj represents the 
contribution to the discrepancy of pair {qj,Pj) G M* and d^{P, K) = \ Xj\. Then 

PrK(P, A-) > H < 2exp < 2exp (^^^^^^,^) . 

Setting a = ^{n, d, 6) = 6V2{(pdy/'^n'^/'^-^/'^y^ln{2/5) reveals Pr[d^(P, iC) > ^(n, d, 5)] < 6. □ 
Again, in we can show the expected discrepancy for any one Gaussian kernel as independent of n. 

4.1 From a Single Kernel to a Range Space, and to ^-Samples 

The above theorems imply colorings with small discrepancy (0(1) in M^) for an arbitrary choice of K G T 
or £ or S, and give a randomized algorithm to construct such a coloring that does not use information about 
the kernel. But this does not yet imply small discrepancy for all choices of € Scr or S simultaneously. 
To do so, we show we only need to consider a polynomial in n number of kernels, and then show that the 
discrepancy is bounded for all of them. 

Note for binary range spaces, this result is usually accomplished by deferring to VC-dimension u, where 
there are at most n'^ distinct subsets of points in the ground set that can be contained in a range. Unlike for 
binary range spaces, this approach does not work for kernel range spaces since even if the same set C P 
have non-zero K{x,p) for p G P^, their discrepancy d^{P, K^) may change by recentering the kernel to x' 
such that Px = Px'- Instead, we use the bounded slope (with respect to the size of the domain) of the kernel. 

For a kernel if G S"^ or S, let Bx^n = {peW^ \ K{x,p) > 1/n}. 

Lemma 4.4. For any xe R'^, vok{Bx,n) < Vd/orS^ and vok{Bx,n) = Vd{\ri{2n))'^/'^ for ^. 

Proof. For S^, clearly Bx,n C Bx = {peW^\ K{x,p) > 0}, and vold(S^) = Va. 

For S, we have k{z) = 1/n for z = Y^ln(2n), and a ball of radius Y^ln(2n) has d-dimensional volume 

Theorem 4.1. In R'^, for X as a family in So- or 9, and a value ^'(n, d, S) = 0{n^/^ ^/'^A/log(n/5)), we 
can choose a coloring x such that Pr[d;^(P, %) > ^{n, d, S)] < 5 for any 6 > 0. 

Proof. Each p ^ P corresponds to a ball Bp^n where K{p, •) > l/n. Let U = Upep „. For any q ^ U, 
then X^pgp K{p, q) < 1, and thus dy,{P, K) < 1; we can ignore these points. 



Also, by Lemma 4.4 the d-dimensional volume of U is at most VdU for S^r and at most Vrfn(ln(2n))'^/^ 
for S- We can then cover U with a net A'^^ such that for each x G C/, there exists some q £ N-j- such that 
1 1 a; — 9 1 1 < T. Based on the ci-dimensional volume of U, there exists such a net of size | A'^^l < 0{T'^n) for 
So and |iV^| < 0(r^n(ln(2(i))'^/2) for 9. 

The maximum slope of a kernel G 9 is o" = 1. Then, if for all q G Nt we have d^^{P, Kg) < D (where 
D > 1), then any point x G M"^ has discrepancy at most d^^{P, Kx) < D + ma. Recall for any x ^ U, 
d^{P, Kx) < 1. Then the ma term follows since by properties of the net we can compare discrepancy to 
a kernel Kg shifted by at most r, and thus this affects the kernel values on n points each by at most ar. 
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Setting, r = l/no" it follows that d^{P, K^) < D + 1 for any x G M'^. Thus for this condition to hold, it 
suffices for \Nq\ = 0(n°'+V°') for and \Nq\ = 0(n^+^(ln(2n))'^/2) foj. g_ 



Setting the probability of failure in Lemma 4.1 to 5' for each such kernel to 6' = Q{S/\Nr\) implies that 
for some value "^{n, d, 6) = nV2-i/do-((/>rf) Vrf^2 ln(2/5') + 1 = 0(ni/2-i/rf^iog(n/5)), for % as or 
g, the Fiid^{P, %) > ^{n, d, 6)] <6. □ 

To transform this discrepancy algorithm into one for e-samples, we need to repeat it successfully 0(log n) 
times. Thus to achieve success probability (/) we can set 5 = 0/ log n above, and g et a discrepancy of at most 
^2 log(^ log n) with probability 1 — (p. Now this and [,30,1 implies Theorem 1.2 We can state the corollary 

below using Varadarajan's 0{n^'^ log^ n) time algorithm |]37l| for computing the min-cost matching in M?; 
the log factor in the runtime can be improved using better SSPD constructions IB to 0(n^'^ log^ n) expected 
time or 0{n^'^ log^ n) deterministic time. 

Corollary 4.1. For any point set P C M^, for any class % of a -bounded kernels or Gaussian kernels, 
in 0{{n/ ^/e) log^(l/e)) expected time (or in 0{{n/^/e) log'^(l/e)) deterministic time) we can create an 
e-sample of size 0((l/e) Y^log(l/e0))/or {P, %), with probability at least 1 — (p. 



5 Extensions 

Bandwidth scaling. A common topic in kernel density estimates is fixing the integral under the kernel 
(usually at 1) and the "bandwidth" w is scaled. For a shift-and-rotation-invariant kernel, where the default 
kernel k has bandwidth 1, a kernel with bandwidth w is written and defined k^{z) = {\ / w'^)k{z / w) . 

Our results do not hold for arbitrarily small bandwidths, because then kw{{)) = l/w'^ becomes arbi- 
trarily large as w shrinks; see Appendix [A] However, fix W as some small constant, and consider all 
kernels %w extended from % to allow any bandwidth w > W. We can construct an e-sample of size 
0((l/e)^/^^-'-/'^A/log(l/e(5)) for {P,%y/)- The observation is that increasing zii to tw' where 1 < w'/w < r] 
affects dy,{P, k^) by at most 0{nri'^~^^) since a and W are assumed constant. Thus we can expand A'^^ by a 
factor of 0{n'^~^^ log n); for each x G N-j- consider 0(n"'+^ log n) different bandwidths increasing geomet- 
rically by (1 + a/n^^^) until we reach n^/'^^^^). For w > n^/('^+^^ the contribution of each of n points to 
dy^{P, Kx) is at most 1/n, so the discrepancy at any x is at most 1. Nt is still poly(n), and hence Theorem 



4.1 still holds for %w in place of some % G {Scr, S}- 

Another perspective in bandwidth scaling is how it affects the constants in the bound directly. The main 
discrepancy results (Theorem 4.1 1 would have a discrepancy bound 0{ra'n}/'^'''^/'^ \/ d\og{n/ 5)) where r 
is the radius of a region of the kernel that does not have a sufficiently small K{p, ■) value, and a is the 
maximum slope. Note that k{Q) < ra. Assume a and r represent the appropriate values when w = 1, 
then as we change w and fix the integral under Kw{x, •) at 1, we have aw = a/w'^ and = rw so 
fwC^w = ra jw'^^^. Thus as the bandwidth w decreases, the discrepancy bound increases at a rate inversely 
proportional to l/w^"^"^). 

Alternatively we can fix kifS) = 1 and scale ks{z) = k{z/s). This trades off = rs and as = a/s, so 
r^as = ra is fixed for any value s. Again, we can increase A^,- by a factor 0{n log n) and cover all values s. 

Lower bounds. We present several straight-forward lower bounds in Appendix [A] In brief we show: 

• random sampling cannot do better on kernel range spaces than on binary range spaces (size 0(l/e^)); 

• e-samples for kernels requires r2(l/e) points; 

• e-samples for (P, 23) require size 0(l/e^'^/(°'+^)) in W'-; and 

• kernels with A;(0) arbitrary large can have unbounded discrepancy. 
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5.1 Future Directions 



We believe we should be able to make this algorithm deterministic using iterative reweighing of the poly(n) 
kernels in A^^^- We suspect in this would yield discrepancy 0(log n) and an e-sample of size 0((l/e) log(l/e)). 

We are not sure that the bounds in for d > 2 require super-linear in 1/e size e-samples. The polynomial 
increase is purely from Jensen's inequality. Perhaps a matching that directly minimizes sum of lengths to 
the dth or (ci/2)th power will yield better results, but these are harder to analyze. Furthermore, we have 
not attempted to optimize for, or solve for any constants. A version of Bern and Eppstein's result |[8j| with 
explicit constants would be of some interest. 

We provide results for classes of fi-bounded kernels S^, Gaussian kernels S, and ball kernels S. In fact 
this covers all classes of kernels described on the Wikipedia page on statistical kernels: http://enr| 
wikipedia . org/wiki/Kernel_ { statistics ) However, we believe we should in principal be 
able to extend our results to all shift-invariant kernels with bounded slope. This can include kernels which 
may be negative (such as the sine or trapezoidal kernels ||T5l ITTl ) and have nice L2 KDE approximation 
properties. These sometimes-negative kernels cannot use the e-sample result from f2V\ because their super- 
level sets have unbounded VC-dimension since k{z) = for infinitely many disjoint values of z. 

Edmond's min-cost matching algorithm ifTSll runs in 0{n^) time in and Varadarajan's improvement ||37]| 
runs in 0(n^-^ log^ n) in (and can be improved to 0(n^-^ log^ n) ||T]). However, e-approximate algo- 
rithms [38l run in 0((n/e'^) log^ n) in M^. As this result governs the runtime for computing a coloring, 
and hence an e-sample, it would be interesting to see if even a constant-factor approximation could at- 
tain the same asymptotic discrepancy results; this would imply and e-sample algorithm that ran in time 
0(n • poly log(l/e)) for 1/e < n. 
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A Lower Bounds 



We state here a few straight- forward lower bounds. These results are not difficult, and are often left implied, 
but we provide them here for completeness. 

Random sampling. We can show that in M"^ for a constant d, a random sample of size 0((l/e^) log(l/5)) 
is an e-sample for all shift-invariant, non-negative, non-increasing kernels (via the improved bound |[24l on 
VC-dimension-bounded range spaces f36l since they describe super-level sets of these kernels using f2V\). 
We can also show that we cannot do any better, even for one kernel K{x, •). Consider a set P of n points 
where n/2 points Pa are located at location a, and n/2 points Pb are located at location b. Let \\a — b\\ be 
large enough that K{a, b) < Xjr? (for instance for ii' € T let K(a, 6) = 0). Then for an e-sample S C P 
of size k, we require that Sa C Pa has size ka such that \ka — k/2\ < ek. The probability (via Proposition 
7.3.2 [28] fore < 1/8) 

Pr[|Pa| > k/2 + ek)] > — exp{-16{£kf /k) > —exp{-l6£^k) > 6. 

15 15 

Solving for k reveals k < (l/16e^) ln(l/15(5). Thus if A; = o((l/e^) log(l/5)) the random sample will 
have more than 6 probability of having more than e-error. 

Requirement of 1/e points for e-samples. Consider a set P where there are only t = — 1 

distinct locations of points, each containing \P\/t points from P. Then we can consider a range for each of 
the distinct points that contains only those points, or a kernel K{xi, 0) for i G [t] that is non-zero (or very 
small) for only the points in the ith location. Thus if any distinct location in P is not represented in the 
e-sample S, then the query on the range/kernel registering just those points would have error greater than e. 
Thus an e-sample must have size at least [1/e] — 1. 

Minimum e-sample for balls. We can achieve a lower bound for the size of e-samples in terms of e for 
any range space {P,A) by leveraging known lower bounds from discrepancy. For instance, for (P, 23) it is 
known (see |[26l for many such results) that there exists points sets P C M'^ of size n such that d{P, S) > 
r2(n^/^~^/^'^). To translate this to a lower bound for the size of an e-sample S on (P, S) we follow a 
technique outlined in Lemma 1.6 [.26il . An e-sample of size n/2 implies for an absolute constant C that 

^^i/2-i/2d < ^(p^^^ <en. 

Hence, solving forn/2 (the size of our e- sample) in terms of e reveals that IS" I = n/2 > (l/2)(C/e)2'^/('^+^). 
Hence for e small enough so \S\ > n/2 reveals that the size of the e-sample for (P, is ^{1/6^'^/^'^+'^^). 

Delta kernels. We make the assumption that K{x, x) = 1, if instead K{x, x) = rj and r] is arbitrarily 
large, then the discrepancy d{P, K) = Q.{r]). We can place one point p G P far enough from all other points 
p' £ P so K{p,p') is arbitrarily small. Now dy^{P, K{p, •)) approaches rj, no matter whether xip) = +1 or 
X{p) = -1. 
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